#read in data
air2002 <- fread("ad_viz_plotval_data_2022.csv")
air2022 <- fread("ad_viz_plotval_data_2002.csv")Assignment_1
Due Date
This assignment is due by 11:59pm Pacific Time, September 27th, 2024.
Learning Goals
- Download, read, and get familiar with an external dataset.
- Step through the EDA “checklist” presented in class
- Practice making exploratory plots
Assignment Description
We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
A primer on particulate matter air pollution can be found here.
Your assignment should be completed in Quarto or R Markdown.
Steps
- Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using
data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
#observe data
dim(air2002)[1] 59756 22
head(air2002) Date Source Site ID POC Daily Mean PM2.5 Concentration Units
<char> <char> <int> <int> <num> <char>
1: 01/01/2022 AQS 60010007 3 12.7 ug/m3 LC
2: 01/02/2022 AQS 60010007 3 13.9 ug/m3 LC
3: 01/03/2022 AQS 60010007 3 7.1 ug/m3 LC
4: 01/04/2022 AQS 60010007 3 3.7 ug/m3 LC
5: 01/05/2022 AQS 60010007 3 4.2 ug/m3 LC
6: 01/06/2022 AQS 60010007 3 3.8 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
<int> <char> <int> <num>
1: 58 Livermore 1 100
2: 60 Livermore 1 100
3: 39 Livermore 1 100
4: 21 Livermore 1 100
5: 23 Livermore 1 100
6: 21 Livermore 1 100
AQS Parameter Code AQS Parameter Description Method Code
<int> <char> <int>
1: 88101 PM2.5 - Local Conditions 170
2: 88101 PM2.5 - Local Conditions 170
3: 88101 PM2.5 - Local Conditions 170
4: 88101 PM2.5 - Local Conditions 170
5: 88101 PM2.5 - Local Conditions 170
6: 88101 PM2.5 - Local Conditions 170
Method Description CBSA Code
<char> <int>
1: Met One BAM-1020 Mass Monitor w/VSCC 41860
2: Met One BAM-1020 Mass Monitor w/VSCC 41860
3: Met One BAM-1020 Mass Monitor w/VSCC 41860
4: Met One BAM-1020 Mass Monitor w/VSCC 41860
5: Met One BAM-1020 Mass Monitor w/VSCC 41860
6: Met One BAM-1020 Mass Monitor w/VSCC 41860
CBSA Name State FIPS Code State
<char> <int> <char>
1: San Francisco-Oakland-Hayward, CA 6 California
2: San Francisco-Oakland-Hayward, CA 6 California
3: San Francisco-Oakland-Hayward, CA 6 California
4: San Francisco-Oakland-Hayward, CA 6 California
5: San Francisco-Oakland-Hayward, CA 6 California
6: San Francisco-Oakland-Hayward, CA 6 California
County FIPS Code County Site Latitude Site Longitude
<int> <char> <num> <num>
1: 1 Alameda 37.68753 -121.7842
2: 1 Alameda 37.68753 -121.7842
3: 1 Alameda 37.68753 -121.7842
4: 1 Alameda 37.68753 -121.7842
5: 1 Alameda 37.68753 -121.7842
6: 1 Alameda 37.68753 -121.7842
tail(air2002) Date Source Site ID POC Daily Mean PM2.5 Concentration Units
<char> <char> <int> <int> <num> <char>
1: 12/01/2022 AQS 61131003 1 3.4 ug/m3 LC
2: 12/07/2022 AQS 61131003 1 3.8 ug/m3 LC
3: 12/13/2022 AQS 61131003 1 6.0 ug/m3 LC
4: 12/19/2022 AQS 61131003 1 34.8 ug/m3 LC
5: 12/25/2022 AQS 61131003 1 23.2 ug/m3 LC
6: 12/31/2022 AQS 61131003 1 1.0 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
<int> <char> <int> <num>
1: 19 Woodland-Gibson Road 1 100
2: 21 Woodland-Gibson Road 1 100
3: 33 Woodland-Gibson Road 1 100
4: 99 Woodland-Gibson Road 1 100
5: 77 Woodland-Gibson Road 1 100
6: 6 Woodland-Gibson Road 1 100
AQS Parameter Code AQS Parameter Description Method Code
<int> <char> <int>
1: 88101 PM2.5 - Local Conditions 145
2: 88101 PM2.5 - Local Conditions 145
3: 88101 PM2.5 - Local Conditions 145
4: 88101 PM2.5 - Local Conditions 145
5: 88101 PM2.5 - Local Conditions 145
6: 88101 PM2.5 - Local Conditions 145
Method Description CBSA Code
<char> <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
CBSA Name State FIPS Code State
<char> <int> <char>
1: Sacramento--Roseville--Arden-Arcade, CA 6 California
2: Sacramento--Roseville--Arden-Arcade, CA 6 California
3: Sacramento--Roseville--Arden-Arcade, CA 6 California
4: Sacramento--Roseville--Arden-Arcade, CA 6 California
5: Sacramento--Roseville--Arden-Arcade, CA 6 California
6: Sacramento--Roseville--Arden-Arcade, CA 6 California
County FIPS Code County Site Latitude Site Longitude
<int> <char> <num> <num>
1: 113 Yolo 38.66121 -121.7327
2: 113 Yolo 38.66121 -121.7327
3: 113 Yolo 38.66121 -121.7327
4: 113 Yolo 38.66121 -121.7327
5: 113 Yolo 38.66121 -121.7327
6: 113 Yolo 38.66121 -121.7327
colnames(air2002) [1] "Date" "Source"
[3] "Site ID" "POC"
[5] "Daily Mean PM2.5 Concentration" "Units"
[7] "Daily AQI Value" "Local Site Name"
[9] "Daily Obs Count" "Percent Complete"
[11] "AQS Parameter Code" "AQS Parameter Description"
[13] "Method Code" "Method Description"
[15] "CBSA Code" "CBSA Name"
[17] "State FIPS Code" "State"
[19] "County FIPS Code" "County"
[21] "Site Latitude" "Site Longitude"
str(air2002)Classes 'data.table' and 'data.frame': 59756 obs. of 22 variables:
$ Date : chr "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 3 3 3 3 3 3 3 3 3 3 ...
$ Daily Mean PM2.5 Concentration: num 12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
$ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily AQI Value : int 58 60 39 21 23 21 13 38 59 55 ...
$ Local Site Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily Obs Count : int 1 1 1 1 1 1 1 1 1 1 ...
$ Percent Complete : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS Parameter Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS Parameter Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method Code : int 170 170 170 170 170 170 170 170 170 170 ...
$ Method Description : chr "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
$ CBSA Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State FIPS Code : int 6 6 6 6 6 6 6 6 6 6 ...
$ State : chr "California" "California" "California" "California" ...
$ County FIPS Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
$ Site Longitude : num -122 -122 -122 -122 -122 ...
- attr(*, ".internal.selfref")=<externalptr>
summary(air2002) Date Source Site ID POC
Length:59756 Length:59756 Min. :60010007 Min. : 1.00
Class :character Class :character 1st Qu.:60290019 1st Qu.: 1.00
Mode :character Mode :character Median :60631006 Median : 3.00
Mean :60563315 Mean : 3.77
3rd Qu.:60731026 3rd Qu.: 3.00
Max. :61131003 Max. :24.00
Daily Mean PM2.5 Concentration Units Daily AQI Value
Min. : -6.700 Length:59756 Min. : 0.00
1st Qu.: 4.100 Class :character 1st Qu.: 23.00
Median : 6.800 Mode :character Median : 38.00
Mean : 8.428 Mean : 39.28
3rd Qu.: 10.700 3rd Qu.: 54.00
Max. :302.500 Max. :454.00
Local Site Name Daily Obs Count Percent Complete AQS Parameter Code
Length:59756 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88192
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88101
Max. :1 Max. :100 Max. :88502
AQS Parameter Description Method Code Method Description CBSA Code
Length:59756 Min. :143 Length:59756 Min. :12540
Class :character 1st Qu.:170 Class :character 1st Qu.:31080
Mode :character Median :170 Mode :character Median :40140
Mean :336 Mean :34957
3rd Qu.:707 3rd Qu.:41860
Max. :810 Max. :49700
NA's :4567
CBSA Name State FIPS Code State County FIPS Code
Length:59756 Min. :6 Length:59756 Min. : 1.00
Class :character 1st Qu.:6 Class :character 1st Qu.: 29.00
Mode :character Median :6 Mode :character Median : 63.00
Mean :6 Mean : 56.19
3rd Qu.:6 3rd Qu.: 73.00
Max. :6 Max. :113.00
County Site Latitude Site Longitude
Length:59756 Min. :32.58 Min. :-124.2
Class :character 1st Qu.:34.07 1st Qu.:-121.4
Mode :character Median :36.49 Median :-119.6
Mean :36.24 Mean :-119.6
3rd Qu.:37.96 3rd Qu.:-117.9
Max. :41.76 Max. :-115.5
colSums(is.na(air2002)) Date Source
0 0
Site ID POC
0 0
Daily Mean PM2.5 Concentration Units
0 0
Daily AQI Value Local Site Name
0 0
Daily Obs Count Percent Complete
0 0
AQS Parameter Code AQS Parameter Description
0 0
Method Code Method Description
0 0
CBSA Code CBSA Name
4567 0
State FIPS Code State
0 0
County FIPS Code County
0 0
Site Latitude Site Longitude
0 0
Dimensions: 22x59756 Column Names(type): Date(chr), Source(chr), Site ID(int), POC(int), Daily Mean PM2.5 Concentration(num), Units(chr), Daily AQI Value(int), Local Site Name(chr), Daily Obs Count(int), Percent Complete(num), AQS Parameter Code(int), AQS Parameter Description (chr), Method Code(int), Method Description(chr), CBSA Code(int), CBSA Name(chr), State FIPS Code(int), State(chr), County FIPS Code(int), County(chr), Site Latitude(num), Site Longitude(num) No NAs in the data set
dim(air2022)[1] 15976 22
head(air2022) Date Source Site ID POC Daily Mean PM2.5 Concentration Units
<char> <char> <int> <int> <num> <char>
1: 01/05/2002 AQS 60010007 1 25.1 ug/m3 LC
2: 01/06/2002 AQS 60010007 1 31.6 ug/m3 LC
3: 01/08/2002 AQS 60010007 1 21.4 ug/m3 LC
4: 01/11/2002 AQS 60010007 1 25.9 ug/m3 LC
5: 01/14/2002 AQS 60010007 1 34.5 ug/m3 LC
6: 01/17/2002 AQS 60010007 1 41.0 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
<int> <char> <int> <num>
1: 81 Livermore 1 100
2: 93 Livermore 1 100
3: 74 Livermore 1 100
4: 82 Livermore 1 100
5: 98 Livermore 1 100
6: 115 Livermore 1 100
AQS Parameter Code AQS Parameter Description Method Code
<int> <char> <int>
1: 88101 PM2.5 - Local Conditions 120
2: 88101 PM2.5 - Local Conditions 120
3: 88101 PM2.5 - Local Conditions 120
4: 88101 PM2.5 - Local Conditions 120
5: 88101 PM2.5 - Local Conditions 120
6: 88101 PM2.5 - Local Conditions 120
Method Description CBSA Code
<char> <int>
1: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
2: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
3: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
4: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
5: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
6: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
CBSA Name State FIPS Code State
<char> <int> <char>
1: San Francisco-Oakland-Hayward, CA 6 California
2: San Francisco-Oakland-Hayward, CA 6 California
3: San Francisco-Oakland-Hayward, CA 6 California
4: San Francisco-Oakland-Hayward, CA 6 California
5: San Francisco-Oakland-Hayward, CA 6 California
6: San Francisco-Oakland-Hayward, CA 6 California
County FIPS Code County Site Latitude Site Longitude
<int> <char> <num> <num>
1: 1 Alameda 37.68753 -121.7842
2: 1 Alameda 37.68753 -121.7842
3: 1 Alameda 37.68753 -121.7842
4: 1 Alameda 37.68753 -121.7842
5: 1 Alameda 37.68753 -121.7842
6: 1 Alameda 37.68753 -121.7842
colnames(air2022) [1] "Date" "Source"
[3] "Site ID" "POC"
[5] "Daily Mean PM2.5 Concentration" "Units"
[7] "Daily AQI Value" "Local Site Name"
[9] "Daily Obs Count" "Percent Complete"
[11] "AQS Parameter Code" "AQS Parameter Description"
[13] "Method Code" "Method Description"
[15] "CBSA Code" "CBSA Name"
[17] "State FIPS Code" "State"
[19] "County FIPS Code" "County"
[21] "Site Latitude" "Site Longitude"
tail(air2022) Date Source Site ID POC Daily Mean PM2.5 Concentration Units
<char> <char> <int> <int> <num> <char>
1: 12/10/2002 AQS 61131003 1 15 ug/m3 LC
2: 12/13/2002 AQS 61131003 1 15 ug/m3 LC
3: 12/22/2002 AQS 61131003 1 1 ug/m3 LC
4: 12/25/2002 AQS 61131003 1 23 ug/m3 LC
5: 12/28/2002 AQS 61131003 1 5 ug/m3 LC
6: 12/31/2002 AQS 61131003 1 6 ug/m3 LC
Daily AQI Value Local Site Name Daily Obs Count Percent Complete
<int> <char> <int> <num>
1: 62 Woodland-Gibson Road 1 100
2: 62 Woodland-Gibson Road 1 100
3: 6 Woodland-Gibson Road 1 100
4: 77 Woodland-Gibson Road 1 100
5: 28 Woodland-Gibson Road 1 100
6: 33 Woodland-Gibson Road 1 100
AQS Parameter Code AQS Parameter Description Method Code
<int> <char> <int>
1: 88101 PM2.5 - Local Conditions 117
2: 88101 PM2.5 - Local Conditions 117
3: 88101 PM2.5 - Local Conditions 117
4: 88101 PM2.5 - Local Conditions 117
5: 88101 PM2.5 - Local Conditions 117
6: 88101 PM2.5 - Local Conditions 117
Method Description CBSA Code
<char> <int>
1: R & P Model 2000 PM2.5 Sampler w/WINS 40900
2: R & P Model 2000 PM2.5 Sampler w/WINS 40900
3: R & P Model 2000 PM2.5 Sampler w/WINS 40900
4: R & P Model 2000 PM2.5 Sampler w/WINS 40900
5: R & P Model 2000 PM2.5 Sampler w/WINS 40900
6: R & P Model 2000 PM2.5 Sampler w/WINS 40900
CBSA Name State FIPS Code State
<char> <int> <char>
1: Sacramento--Roseville--Arden-Arcade, CA 6 California
2: Sacramento--Roseville--Arden-Arcade, CA 6 California
3: Sacramento--Roseville--Arden-Arcade, CA 6 California
4: Sacramento--Roseville--Arden-Arcade, CA 6 California
5: Sacramento--Roseville--Arden-Arcade, CA 6 California
6: Sacramento--Roseville--Arden-Arcade, CA 6 California
County FIPS Code County Site Latitude Site Longitude
<int> <char> <num> <num>
1: 113 Yolo 38.66121 -121.7327
2: 113 Yolo 38.66121 -121.7327
3: 113 Yolo 38.66121 -121.7327
4: 113 Yolo 38.66121 -121.7327
5: 113 Yolo 38.66121 -121.7327
6: 113 Yolo 38.66121 -121.7327
str(air2022)Classes 'data.table' and 'data.frame': 15976 obs. of 22 variables:
$ Date : chr "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Daily Mean PM2.5 Concentration: num 25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
$ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily AQI Value : int 81 93 74 82 98 115 89 62 69 107 ...
$ Local Site Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily Obs Count : int 1 1 1 1 1 1 1 1 1 1 ...
$ Percent Complete : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS Parameter Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS Parameter Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method Code : int 120 120 120 120 120 120 120 120 120 120 ...
$ Method Description : chr "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
$ CBSA Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State FIPS Code : int 6 6 6 6 6 6 6 6 6 6 ...
$ State : chr "California" "California" "California" "California" ...
$ County FIPS Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
$ Site Longitude : num -122 -122 -122 -122 -122 ...
- attr(*, ".internal.selfref")=<externalptr>
summary(air2022) Date Source Site ID POC
Length:15976 Length:15976 Min. :60010007 Min. :1.000
Class :character Class :character 1st Qu.:60290014 1st Qu.:1.000
Mode :character Mode :character Median :60590007 Median :1.000
Mean :60549600 Mean :1.581
3rd Qu.:60731002 3rd Qu.:1.000
Max. :61131003 Max. :6.000
Daily Mean PM2.5 Concentration Units Daily AQI Value
Min. : 0.00 Length:15976 Min. : 0.00
1st Qu.: 7.00 Class :character 1st Qu.: 39.00
Median : 12.00 Mode :character Median : 56.00
Mean : 16.12 Mean : 59.28
3rd Qu.: 20.50 3rd Qu.: 72.00
Max. :104.30 Max. :185.00
Local Site Name Daily Obs Count Percent Complete AQS Parameter Code
Length:15976 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88215
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
Max. :1 Max. :100 Max. :88502
AQS Parameter Description Method Code Method Description CBSA Code
Length:15976 Min. :117 Length:15976 Min. :12540
Class :character 1st Qu.:120 Class :character 1st Qu.:23420
Mode :character Median :120 Mode :character Median :40140
Mean :297 Mean :33270
3rd Qu.:707 3rd Qu.:41740
Max. :810 Max. :49700
NA's :929
CBSA Name State FIPS Code State County FIPS Code
Length:15976 Min. :6 Length:15976 Min. : 1.00
Class :character 1st Qu.:6 Class :character 1st Qu.: 29.00
Mode :character Median :6 Mode :character Median : 59.00
Mean :6 Mean : 54.78
3rd Qu.:6 3rd Qu.: 73.00
Max. :6 Max. :113.00
County Site Latitude Site Longitude
Length:15976 Min. :32.63 Min. :-124.2
Class :character 1st Qu.:34.07 1st Qu.:-121.4
Mode :character Median :35.36 Median :-119.1
Mean :36.00 Mean :-119.4
3rd Qu.:37.77 3rd Qu.:-117.9
Max. :41.71 Max. :-115.5
colSums(is.na(air2022)) Date Source
0 0
Site ID POC
0 0
Daily Mean PM2.5 Concentration Units
0 0
Daily AQI Value Local Site Name
0 0
Daily Obs Count Percent Complete
0 0
AQS Parameter Code AQS Parameter Description
0 0
Method Code Method Description
0 0
CBSA Code CBSA Name
929 0
State FIPS Code State
0 0
County FIPS Code County
0 0
Site Latitude Site Longitude
0 0
Dimensions: 22x15976 Column Names(type): Date(chr), Source(chr), Site ID(int), POC(int), Daily Mean PM2.5 Concentration(num), Units(chr), Daily AQI Value(int), Local Site Name(chr), Daily Obs Count(int), Percent Complete(num), AQS Parameter Code(int), AQS Parameter Description (chr), Method Code(int), Method Description(chr), CBSA Code(int), CBSA Name(chr), State FIPS Code(int), State(chr), County FIPS Code(int), County(chr), Site Latitude(num), Site Longitude(num) No NAs in the data set
- Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
air <- rbind(air2002, air2022)
summary(air$Date) Length Class Mode
75732 character character
air <- air %>%
mutate(Year = year(as.Date(Date, format = "%m/%d/%Y")))
air <- air %>%
rename(PM2.5 = "Daily Mean PM2.5 Concentration")
air <- air %>%
rename(lat = "Site Latitude")
air <- air %>%
rename(long = "Site Longitude")
air <- air %>%
rename(site = "Local Site Name")- Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
pal <- colorFactor(c("lightgreen", "purple"), domain = unique(air$Year))
# Create a leaflet map
leaflet(data = air) %>%
addTiles() %>%
addCircleMarkers(
~long, ~lat,
color = ~pal(Year),
radius = 5,
fillOpacity = 0.7
) %>%
addLegend("bottomright", pal = pal, values = ~Year,
title = "Year",
opacity = 1)The graph shows that there are stations all throughout California that have been relaying data in 2002 and 2022.
sum <- air %>%
group_by(Year) %>%
summarize(
Mean = mean(PM2.5, na.rm = TRUE),
Median = median(PM2.5, na.rm = TRUE),
Min = min(PM2.5, na.rm = TRUE),
Max = max(PM2.5, na.rm = TRUE),
Count = n()
)
print(sum)# A tibble: 2 × 6
Year Mean Median Min Max Count
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 2002 16.1 12 0 104. 15976
2 2022 8.43 6.8 -6.7 302. 59756
For 2002: Mean: 16.12, Min:0, Max: 104.3, Count:15976 For 2022: Mean: 8.43, Min:-6.7, Max:302.5, Count:59756 2002 had a higher mean PM2.5 concentration. 2022 had a higher max PM2.5 value. There are more data points from 2022.
- Check for any missing or implausible values of PM\(_{2.5}\) in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
sum(is.na(air$PM2.5))[1] 0
ggplot(data = air, aes(x = as.factor(Year), y = PM2.5, fill = as.factor(Year))) +
geom_boxplot(outlier.colour = "lightblue") +
labs(title = "Box Plot of PM2.5 by Year",
x = "Year",
y = "PM2.5 Concentration (µg/m³)")sum <- air %>%
group_by(Year) %>%
summarize(
Mean = mean(PM2.5, na.rm = TRUE),
Median = median(PM2.5, na.rm = TRUE),
Min = min(PM2.5, na.rm = TRUE),
Max = max(PM2.5, na.rm = TRUE)
)
print(sum)# A tibble: 2 × 5
Year Mean Median Min Max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2002 16.1 12 0 104.
2 2022 8.43 6.8 -6.7 302.
There appears to be a negative value of PM2.5 in 2022. 2022 also appears to have a very high mx but there are multiple high values so it is possibly valid. The average PM2.5 has also halved since 2002.
Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
- state
- county
- sites in Los Angeles
ggplot(air, aes(x = State, y = PM2.5, fill = as.factor(Year))) +
geom_boxplot(na.rm = TRUE) +
labs(x = "State",
y = "PM2.5 Concentration (µg/m³)",
title = "Box Plot of PM2.5 by State") +
theme_minimal() Based on this box plot, you can see that 2022 has higher max values while 2002 has a higher average
ggplot(air, aes(x = State, y = PM2.5, fill = as.factor(Year))) +
geom_violin(na.rm = TRUE) +
labs(x = "State",
y = "PM2.5 Concentration (µg/m³)",
title = "Violin Plot of PM2.5 by State") +
theme_minimal()Based on this box plot, you can see that 2022 has higher max values while 2002 has a higher average. You can also see that 2002 is more concentrated around the mean.
ggplot(air, aes(x = County, y = PM2.5, fill = as.factor(Year))) +
geom_bar(stat = "summary", fun = mean, position = "dodge", na.rm = TRUE) +
labs(x = "County",
y = "Average PM2.5 Concentration (µg/m³)",
title = "Average PM2.5 by County") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))This bar graph shows the difference in PM2.5 by year in the various counties. You can observe that 2002 has higher averages than 2002 in all counties
losAngeles <- air %>%
filter(County == "Los Angeles")
ggplot(losAngeles, aes(x = site, y = PM2.5, color = as.factor(Year))) +
stat_summary(fun = mean, geom = "point", size = 3, na.rm = TRUE) +
stat_summary(fun = mean, geom = "line", aes(group = Year), na.rm = TRUE) +
labs(x = "Site in LA",
y = "Average PM2.5 Concentration (µg/m³)",
title = "Average PM2.5 by Site in Los Angeles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))This line graph shows the difference in average PM2.5 by year in the sites in LA. 2002 is consistantly has a higher average than 2022.
This homework has been adapted from the case study in Roger Peng’s Exploratory Data Analysis with R